Modelling Project Success by Alfi, Jiaying, Himanshu, Sara


Drawing

Problem Solving Strategy

The ideal machine learning project involves general flow analysis stages for building a Predicting Model. Steps followed to perform data analysis:

  1. Understanding the problem domain
  2. Data Exploration and Preparation
  3. Feature Engineering
  4. Dimensionality Reduction (or Feature Selection)
  5. Various Model Evaluation and
  6. Hyper-parameter Tuning
  7. Ensembling: Model Selection

STEP 1. Understanding the problem domain

  • Kickstarter - Maintains a global crowdfunding platform focused on creativity (films, music, stage shows, comics, journalism, video games, technology and food-related projects)
  • People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges.

Question: why not a model to predict if a project will be successful, failed or cancelled based on given dataset? List of possible predicting factors:

  • Total amount to be raised
  • Total duration of the project
  • Theme of the project
  • Writing style of the project description
  • Length of the project description
  • Project launch time

STEP 2. Data Exploration and Preparation

  • Verified Individual Distinct Column values
  • Class Variable Distribution - (Selected canceled, failed, successful)
     - failed        52.22
     - successful    35.38
     - canceled      10.24
     - undefined      0.94
     - live           0.74
     - suspended      0.49

Cancelled State There are 10% of projects in this dataset are in cancelled state. For Example, Project owner got funding from somewhere else or the project requirements changed which let him recreate online crowd funding campaign.

Since there is no clear reason given in this dataset for Project to get cancelled or no date on which it got cancelled. here, Canceled state should be considered as separate state and not failed.

Total Projects:  378661 
Total Features:  15
Out[6]:
ID name category main_category currency deadline goal launched pledged state backers country usd pledged usd_pledged_real usd_goal_real
0 1000002330 The Songs of Adelaide & Abullah Poetry Publishing GBP 2015-10-09 1000.0 2015-08-11 12:12:28 0.0 failed 0 GB 0.0 0.0 1533.95
1 1000003930 Greeting From Earth: ZGAC Arts Capsule For ET Narrative Film Film & Video USD 2017-11-01 30000.0 2017-09-02 04:43:57 2421.0 failed 15 US 100.0 2421.0 30000.00
2 1000004038 Where is Hank? Narrative Film Film & Video USD 2013-02-26 45000.0 2013-01-12 00:20:50 220.0 failed 3 US 220.0 220.0 45000.00
3 1000007540 ToshiCapital Rekordz Needs Help to Complete Album Music Music USD 2012-04-16 5000.0 2012-03-17 03:24:11 1.0 failed 1 US 1.0 1.0 5000.00
4 1000011046 Community Film Project: The Art of Neighborhoo... Film & Video Film & Video USD 2015-08-29 19500.0 2015-07-04 08:35:03 1283.0 canceled 14 US 1283.0 1283.0 19500.00

Data Cleaning and Noise Removal

  1. Get rid of unwanted columns (ID, goal, pledged, usd_pledged and currency )
  2. Remove Duplicates if exist
  3. Handle Missing Values, in this case Delete those rows
  4. Get rid of noise above 2200000 goal amount (all failed)
  5. Project launched during 1970 and 2018 (6 rows) can be removed
  6. Misrepresented data such as "N,0"" in country column must be addressed, it will be cleaned as a part of data cleaning.

Note: Name column has 4 Nan whereas usd pledged is got 3797 NaN values. This rows can be directly removed as dataset is big enough to perfrom data analysis.

  • Before Cleaning: (378661, 15)
  • After Cleaning: (369678, 10)

Distributions - Outliers and Skew

Numeric variables such as backers, usd_pledged_real, usd_goal_real are higly right skewed because of so many failed instances not having single backers or pledged amount raised. This will be addressed through data normalization while developing a model.

To explore these data it needs to be transformed and then histogram should be created to visualize distributions.

Column usd_goal_real_log usd_pledged_real_log
skew 12.765938 82.063085
count 369678.000000 369678.000000
mean 8.632460 5.775453
std 1.671539 3.309677
min 0.009950 0.000000
25% 7.601402 3.526361
50% 8.612685 6.456770
75% 9.662097 8.314587
max 14.591996 16.828050

Minimum goal amount is as small as 0.01

This is the format of your plot grid:
[ (1,1) x1,y1           -      ]
[ (2,1) x2,y2 ]  [ (2,2) x3,y3 ]

Distributions of Monetory Columns against Class Variable - State

Dataset Amount values are highly right skewed and to view distributions it must be log transformed.

Logarithm: Log of a variable is a common transformation method used to change the shape of distribution of the variable on a distribution plot. It is generally used for reducing right skewness of variables. Though, It can’t be applied to zero or negative values as well.

Distribution shows:

  • Successful Projects had relatively small fundraising goals compare to failed or cancelled Projects.
  • Cancelled and Failed Project goal amount is high after median.
  • 16 % of pledged amount is around 1 USD.

STEP 3. Feature Engineering

  1. Time Data: Launched Year, Launched Month, Launch Day, is_weekend, duration

  2. Categorical Data: Create Dummies for Main Category and Country. Categorical Levels: main_category(15) and category(159) are different level of categories.

  3. Backers - Number of people supporting the project.

  4. Numerical Data: Generate Number of Projects and Mean Goal Amount for each Main category and Sub category, Difference in mean main_category and mean sub category to goal amount.

Numerical Data: Goal - Total fund needed to execute the project and pledged amount is amount raised so far. usd_pledged_real and usd_pledged goal is USD conversion from different currencies using online conversion API.

  1. Text Features: Text Information: name is project name and different text features can be extracted using feature extraction techniques.

Identify values from Project name column. Extract Length, Percentage of Punctuations, Syllable Count, Character Count, Number of Words, Stopwords Count, Capitalized word counts, Number of numeric values and then Clean the data for plotting word cloud

Time: Launched and deadline can be used to identify and extract time related features.

  • Clean Data Shape: (369670, 11)
  • Added Text Features Shape: (369670, 20)
  • Added Numerical Features Shape: (369670, 66)

STEP 4. Dimensionality Reduction (or Feature Selection)

1. Low Variance Filter
2. High Correlation filter
3. Backward Elimination
4. Recursive Feature Elimination

Output from variance, correlation, p-value metric does not reduce much and are not helpful. Later, LDA - did not help. Based on Recursive Elimination using RandomForest Classifier - It gives optimal set of features that can be used for training and testing Prediction model.

  • Optimum number of features: 12
  • Score with 12 features: 0.926990
  • Selected Features: Index(['backers', 'usd_pledged_real', 'usd_goal_real', 'name_len', 'punct%','syllable_count', 'num_chars', 'avg_word', 'launched_year','launched_week', 'duration', 'diff_mean_category_goal'], dtype='object')

STEP 5. Various Model Evaluation

Modelling Classification:-

  • Rebalance number of using data balancing technique ADASYN - Over sampling
  • Save Balanced set of selected feature values for later use such that re-execution of all above steps is not necessary.
  • Apply Various Models with default settings and Check Accuracy/Missclassification Rate
  • Model Prediction done on trained and Test dataset to evaluate whether model predicts good on what it learned and whether it is generalizing on unseen data or not.

Execute Various Classifier Algorithms and Note Accuracy

  1. Model with Default Parameters
  2. Tuned Model
  • Before Balancing Shape X: (369678, 12) y: (369678,)
  • After Balancing Shape X: (584054, 12) y: (584054,)
Running Model
Training Set Accuracy:
0.9999978597860214
Test Set Accuracy:
 0.9032368526936676
Running Model
Training Set Accuracy:
0.9519008310450879
Test Set Accuracy:
 0.9077398532672437

STEP 6. Hyper Parameter Tuning using Randomized Search CV -

  • Grid Search looks at all possible combinations of values specified for hyperparameters and gives the best combination.
  • RandomizedSearchCV

RandomForestClassifier

  • Random Forest

https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/

  1. To improve the predictive power of the model :

    • max_features:
    • n_estimators :
    • min_sample_leaf
  1. Features which will make the model training easier

    • n_jobs : -1 uses all CPUs
    • random_state : Parameter state
    • oob_score :
  2. XGBosst

  3. Light GBM
Out[24]:
mean_fit_time std_fit_time mean_score_time std_score_time param_n_estimators param_max_depth param_criterion param_bootstrap params split0_test_score split1_test_score split2_test_score mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score mean_train_score std_train_score
18 1896.376352 20.553711 77.405066 13.674251 775 85 entropy False {'n_estimators': 775, 'max_depth': 85, 'criter... 0.937809 0.938251 0.938630 0.938230 0.000335 1 1.0 0.999997 0.999997 0.999998 0.000001
23 1946.687666 31.408641 53.318974 2.765041 775 110 entropy False {'n_estimators': 775, 'max_depth': 110, 'crite... 0.937793 0.938019 0.938677 0.938163 0.000375 2 1.0 0.999997 0.999997 0.999998 0.000001
19 2461.662404 26.109106 107.406663 10.958111 1000 85 entropy False {'n_estimators': 1000, 'max_depth': 85, 'crite... 0.938004 0.937998 0.938419 0.938141 0.000197 3 1.0 0.999997 0.999997 0.999998 0.000001
29 2448.812481 24.713323 114.961004 9.517741 1000 None entropy False {'n_estimators': 1000, 'max_depth': None, 'cri... 0.937767 0.938098 0.938509 0.938125 0.000303 4 1.0 0.999997 0.999997 0.999998 0.000001
14 2439.444675 30.269853 123.600921 5.194461 1000 60 entropy False {'n_estimators': 1000, 'max_depth': 60, 'crite... 0.937677 0.938025 0.938604 0.938102 0.000382 5 1.0 0.999997 0.999997 0.999998 0.000001

KFold -

A model can either suffer from underfitting (high bias) if the model is too simple, or it can overfit the training data (high variance) if the model is too complex for the underlying training data.

Ensemble

The main principle behind ensemble modelling is to group weak learners together to form one strong learner. Combine the decisions from multiple models to improve the overall performance.

  1. Max Voting
  2. Averaging
  3. Weighted Averaging
  4. Bagging to decrease the model’s variance; RandomForest
    • Mean of: 0.946, std: (+/-) 0.002 [XGBClassifier]
    • Mean of: 0.943, std: (+/-) 0.003 [Bagging XGBClassifier]
  5. Boosting to decreasing the model’s bias, and; XGBoost
    • Mean of: 0.943, std: (+/-) 0.002 [LGBMClassifier]
    • Mean of: 0.946, std: (+/-) 0.004 [Boosting LGBMClassifier]
  6. Stacking to increasing the predictive force of the classifier. Combinations
    • Best stacking model is TunedETCLassifier, TunedBaggingC with accuracy of: 0.9560
Running Model
Training Set Accuracy:
0.9997559056777393
Test Set Accuracy:
 0.9406930557457882

Finally Selected Model

Combination of TunedETCLassifier, TunedBaggingC is selected as it gives accuracy of 0.9439 and better on predicting Failed Successfull instances. (Checkout Confusion Matrix Below)

Running Model
Training Set Accuracy:
0.9996297429817033
Test Set Accuracy:
 0.9423085154651532
Running Model
Training Set Accuracy:
0.9995698169903027
Test Set Accuracy:
 0.9410072681511159
[MLENS] backend: threading
Running Model
Training Set Accuracy:
0.9995205920687951
Test Set Accuracy:
 0.9414096275179563

NN - Multiclass Classifier with FOur Layers

Drawing

Train Predictions:
467243/467243 [==============================] - 5s 10us/step
Score - 
acc: 89.99%
Test Predictions:
116811/116811 [==============================] - 1s 10us/step
Score - 
acc: 89.94%

Conclusion

  • Improvements
  • Results